{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 图说Batch Normalization\n",
    "\n",
    "https://aibyhand.substack.com/p/9-can-you-calculate-batch-normalization\n",
    "\n",
    "下面是关于批量归一化（Batch Normalization）图例的描述的：\n",
    "\n",
    "[1] 给定\n",
    " 一个包含4个训练样本的小批量数据，每个样本有3个特征。\n",
    "\n",
    "[2] 线性层\n",
    " 通过与权重和偏置相乘来获得新的特征。\n",
    "\n",
    "[3] ReLU\n",
    " 应用ReLU激活函数，其效果是抑制负值。在这个练习中，-2 被设置为 0。\n",
    "\n",
    "[4] 批量统计\n",
    " 计算这个小批量数据中四个样本的总和、平均值、方差和标准差。\n",
    " 注意这些统计量是针对每一行（即，每个特征维度）计算的。\n",
    "\n",
    "[5] 移至均值 = 0\n",
    " 从每个训练样本的激活值中减去均值（绿色）\n",
    " 预期的效果是让每个维度中的4个激活值的平均值为零。\n",
    "\n",
    "[6] 缩放至方差 = 1\n",
    " 除以标准差（橙色）\n",
    " 预期的效果是让这4个激活值具有等于一的方差。\n",
    "\n",
    "[7] 缩放与平移\n",
    " 将步骤[6]中得到的标准化特征乘以一个线性变换矩阵，并将结果传递给下一层。\n",
    " 预期的效果是对标准化特征值进行缩放和平移，以获得一个新的均值和方差，这些是由网络学习得到的。\n",
    " 对角线元素和最后一列是网络将要学习的可训练参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"bn0.png\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Image\n",
    "Image(url= \"bn0.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 批量归一化（Batch Normalization，简称BN）原理\n",
    "\n",
    "批量归一化（Batch Normalization，简称BN）是一种用于加速深度神经网络训练的技术，由Sergey Ioffe和Christian Szegedy在2015年提出。其主要目的是减少内部协变量偏移（internal covariate shift），使得网络能够更快地收敛，并且有助于减轻过拟合的问题。\n",
    "\n",
    "以下是批量归一化的详细介绍：\n",
    "\n",
    "### 基本概念\n",
    "\n",
    "批量归一化是在神经网络的隐藏层之间插入的一种规范化技术，它通过对网络每一层的输入进行规范化处理，确保这些输入具有零均值和单位方差。这种处理可以改善训练过程中的梯度流，并且有助于使每一层的学习任务更加稳定和简单。\n",
    "\n",
    "### 工作原理\n",
    "\n",
    "批量归一化的工作原理如下：\n",
    "\n",
    "1. **计算统计数据**：\n",
    "   - 在训练过程中，对于每一个mini-batch的数据，BN会计算该批次数据的均值和方差。\n",
    "   - 这些统计数据是基于当前mini-batch中的所有样本计算得出的。\n",
    "\n",
    "2. **标准化**：\n",
    "   - 使用上述计算出的均值和方差对mini-batch中的数据进行标准化，使其具有零均值和单位方差。\n",
    "   - 标准化公式可以表示为：$$ \\hat{x}^{(k)} = \\frac{x^{(k)} - \\mu^{(k)}}{\\sqrt{\\sigma^{(2)(k)} + \\epsilon}} $$，其中 $ x^{(k)} $ 是第k个样本，$ \\mu^{(k)} $ 和 $ \\sigma^{(2)(k)} $ 分别是均值和方差，$ \\epsilon $ 是一个小常数用来防止除以零的情况发生。\n",
    "\n",
    "3. **缩放和平移**：\n",
    "   - 在标准化之后，BN允许对数据进行缩放和平移，以便模型可以学习到适当的分布。\n",
    "   - 这是通过两个可学习的参数 γ（scale）和 β（shift）实现的：$$ y^{(k)} = \\gamma \\hat{x}^{(k)} + \\beta $$。\n",
    "\n",
    "### 优点\n",
    "\n",
    "- **加速训练**：BN减少了内部协变量偏移，从而加快了训练速度。\n",
    "- **正则化效果**：BN有一定的正则化作用，可以帮助减少过拟合的风险。\n",
    "- **更宽松的初始化要求**：使用BN后，模型对权重的初始值更加不敏感。\n",
    "- **可能允许更高的学习率**：由于BN稳定了训练过程，因此可以使用更高的学习率。\n",
    "\n",
    "### 测试阶段\n",
    "\n",
    "在测试阶段，我们不再使用mini-batch的统计数据来进行归一化，而是使用整个训练集的移动平均统计值来替代，这样可以更好地反映整个数据集的特性。\n",
    "\n",
    "需要注意的是，尽管BN在很多情况下表现出了显著的优势，但它并不是万能的解决方案，在某些特定的情况下可能不会带来明显的改进。此外，BN对某些类型的模型，如RNN，可能不如对CNN那么有效。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Normalization在PyTorch中的实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### nn.BatchNorm基本使用方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;在PyTorch中，我们使用nn.Linear构建线性层，类似的，我们通过使用nn.BatchNorm类来构建BN层，进而可以进行Batch Normalization归一化操作，并且在后续操作过程中我们会发现，BN层和线性层之间由诸多相似之处。同时，出于对不同类型数据的不同处理需求，nn.BatchNorm1d类主要用于处理2d数据，nn.BatchNorm2d则主要用来处理3d数据。简单来说，目前我们所使用的面板数据（二维表格数据）都属于2d数据，使用nn.BatchNorm1d处理即可。而后续在处理图像数据时，则需要视情况和使用nn.BatchNorm2d类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# 查看帮助\n",
    "#nn.BatchNorm1d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先我们解释其中三个核心参数："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- num_features：输入数据的特征数量（假设为n），也就是前一层神经元数量或原始数据集特征数量，根据此前的论述，BN层最终构建的是一个n*n的对角矩阵，对角线元素包含$\\gamma$，并且截距项为$\\beta$；\n",
    "- eps：方差分母修正项，为了防止分母为0，一般取值为1e-5，也就是类默认值；\n",
    "- affine：是否进行仿射变换，需要注意的是，此时进行仿射变换时将使用无偏估计进行期望和方差的计算，并且初始条件下$\\gamma=1，\\beta=0$，当参数取值为True时，会显式设置$\\gamma和\\beta$参数并带入进行梯度下降迭代计算，取值为False时，参数不显示，实际的数据归一化过程就是对原数据进行无偏估计下的Z-Score变换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "简单进行尝试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 1., 2.],\n",
       "        [3., 4., 5.],\n",
       "        [6., 7., 8.]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f = torch.arange(9).reshape(3, 3).float()\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化一个BN类\n",
    "bn1 = nn.BatchNorm1d(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.2247, -1.2247, -1.2247],\n",
       "        [ 0.0000,  0.0000,  0.0000],\n",
       "        [ 1.2247,  1.2247,  1.2247]], grad_fn=<NativeBatchNormBackward0>)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn1(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 1., 2.],\n",
       "        [3., 4., 5.],\n",
       "        [6., 7., 8.]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1., -1., -1.],\n",
       "        [ 0.,  0.,  0.],\n",
       "        [ 1.,  1.,  1.]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Z-Score计算结果\n",
    "(f - torch.mean(f, 0)) / torch.std(f, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.2247, -1.2247, -1.2247],\n",
       "        [ 0.0000,  0.0000,  0.0000],\n",
       "        [ 1.2247,  1.2247,  1.2247]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 样本方差计算结果\n",
    "(f - torch.mean(f, 0)) / torch.std(f, 0,unbiased=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> pytorch中默认的标准差计算是样本无偏估计，而BN再进行归一化计算时实际上采用的是样本方差，二者区别如下，样本标准差为：$$std(x) =  \\sqrt\\frac{\\sum^n_{i=1}(x_i-\\bar x)^2}{n}$$但对样本整体方差的无偏估计为：$$std_{unbiased}(x) = \\sqrt \\frac{\\sum^n_{i=1}(x_i-\\bar x)^2}{n-1}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;当然，根据此前的介绍，BN层和线性层有许多类似的地方，这里我们可以查看BN层的参数情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Parameter containing:\n",
       " tensor([1., 1., 1.], requires_grad=True),\n",
       " Parameter containing:\n",
       " tensor([0., 0., 0.], requires_grad=True)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(bn1.parameters())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "据此也可确认在迭代开始之前，$\\gamma和\\beta$的初始值。同时，根据参数的可微性，我们也能看出其最终也是需要通过反向传播计算梯度，然后利用梯度下降进行求解的。当然，此处如果我们将affine参数设置为False，则无法查看参数，并且归一化过程就是简单的在无偏估计下对数据进行Z-Score归一化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "bn2 = nn.BatchNorm1d(3, affine=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(bn2.parameters())       # 此时无法查看bn2参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.2247, -1.2247, -1.2247],\n",
       "        [ 0.0000,  0.0000,  0.0000],\n",
       "        [ 1.2247,  1.2247,  1.2247]])"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn2(f)                       # 处理后的数据也不是可微的，无法进行反向传播，因此也无法修改BN参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练数据均值与方差的迭代计算\n",
    "\n",
    "* 训练阶段的统计信息更新\n",
    "\n",
    "在训练阶段，对于每一个mini-batch，BN层都会计算出该mini-batch的均值和方差，并使用这些统计信息来归一化数据。同时，BN层也会更新整个训练集的均值和方差的估计值。这一过程通常通过移动平均（moving average）的方式来实现。\n",
    "\n",
    "* 计算mini-batch的统计信息\n",
    "\n",
    "对于一个mini-batch中的数据，假设mini-batch的大小为$ m $，特征维度为$ n $，那么mini-batch中的数据可以表示为 $ X = \\{x_{ij}\\}_{m \\times n} $，其中 $ x_{ij} $ 表示第 $ i $ 个样本的第 $ j $ 个特征。\n",
    "\n",
    "* 更新累计统计信息\n",
    "\n",
    "在训练过程中，BN层会使用一个移动平均来更新整个训练集的统计信息。移动平均公式如下：\n",
    "\n",
    "1. **更新均值的移动平均**：\n",
    "\n",
    "$$ \\mu_{\\text{moving}} \\leftarrow \\alpha \\cdot \\mu_{\\text{moving}} + (1 - \\alpha) \\cdot \\mu $$\n",
    "\n",
    "其中 $ \\alpha $ 是一个接近1的衰减因子（通常取值为0.9或0.999），$ \\mu $ 是当前mini-batch的均值。\n",
    "\n",
    "2. **更新方差的移动平均**：\n",
    "\n",
    "$$ \\sigma_{\\text{moving}}^2 \\leftarrow \\alpha \\cdot \\sigma_{\\text{moving}}^2 + (1 - \\alpha) \\cdot \\sigma^2 $$\n",
    "\n",
    "其中 $ \\sigma^2 $ 是当前mini-batch的方差。\n",
    "\n",
    "* 测试阶段使用累积统计信息\n",
    "\n",
    "在测试阶段，BN层会使用整个训练过程中累积的统计信息来进行归一化。这意味着在测试时，BN层会使用上述计算得到的 $ \\mu_{\\text{moving}} $ 和 $ \\sigma_{\\text{moving}}^2 $ 来归一化输入数据。\n",
    "\n",
    "* 实际应用中的实现\n",
    "\n",
    "在PyTorch中，BN层会自动处理这些计算和更新。你可以通过查看 `running_mean` 和 `running_var` 属性来了解累积的统计信息：\n",
    "\n",
    "```python\n",
    "# 获取BN层的累积统计信息\n",
    "running_mean = bn_layer.running_mean\n",
    "running_var = bn_layer.running_var\n",
    "```\n",
    "\n",
    "在训练过程中，这些属性会被自动更新。在测试阶段，BN层会使用这些累积的统计信息来进行归一化。\n",
    "\n",
    "* 总结\n",
    "\n",
    "通过使用移动平均来更新整个训练集的统计信息，BN层能够在训练过程中不断调整归一化的参数，并在测试阶段使用更准确的统计信息来进行归一化，从而提高模型的稳定性和泛化能力。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 1., 2.],\n",
       "        [3., 4., 5.],\n",
       "        [6., 7., 8.]])"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f = torch.arange(9).reshape(3, 3).float()\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "bn1 = nn.BatchNorm1d(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "#help(bn1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0., 0., 0.])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn1.running_mean\n",
    "#bn1.running_var"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.2247, -1.2247, -1.2247],\n",
       "        [ 0.0000,  0.0000,  0.0000],\n",
       "        [ 1.2247,  1.2247,  1.2247]], grad_fn=<NativeBatchNormBackward0>)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn1(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0.3000, 0.4000, 0.5000])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn1.running_mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([3., 4., 5.])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f.mean(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0.3000, 0.4000, 0.5000])"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn1.running_mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "bn3 = nn.BatchNorm1d(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Parameter containing:\n",
       " tensor([1., 1., 1.], requires_grad=True),\n",
       " Parameter containing:\n",
       " tensor([0., 0., 0.], requires_grad=True)]"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(bn3.parameters())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([2.9453, 3.9270, 4.9088])\n",
      "tensor([2.9507, 3.9343, 4.9179])\n"
     ]
    }
   ],
   "source": [
    "print(bn3.running_mean)\n",
    "bn3(f)\n",
    "print(bn3.running_mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.30000000000000004"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "0*0.9+0.1*3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5700000000000001"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "0.3*0.9+0.1*3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([0.3000, 0.4000, 0.5000])\n",
      "tensor([0.5700, 0.7600, 0.9500])\n"
     ]
    }
   ],
   "source": [
    "print(bn3.running_mean)\n",
    "bn3(f)\n",
    "print(bn3.running_mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([3., 4., 5.])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.mean(f, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'bn3' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[19], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mbn3\u001b[49m\u001b[38;5;241m.\u001b[39mrunning_mean \u001b[38;5;241m*\u001b[39m \u001b[38;5;241m0.9\u001b[39m \u001b[38;5;241m+\u001b[39m \u001b[38;5;241m0.1\u001b[39m \u001b[38;5;241m*\u001b[39m torch\u001b[38;5;241m.\u001b[39mmean(f, \u001b[38;5;241m0\u001b[39m)\n",
      "\u001b[1;31mNameError\u001b[0m: name 'bn3' is not defined"
     ]
    }
   ],
   "source": [
    "bn3.running_mean * 0.9 + 0.1 * torch.mean(f, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([1.8000, 1.8000, 1.8000])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn3.running_var * 0.9 + 0.1 * torch.var(f, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，在这种设置下，如果同一批数据，当迭代很多轮之后，running_mean/var就会最终取到sample_mean/var。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够发现，BN记录结果朝向整体真实结果靠拢。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 迭代记录的原因"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;至于为什么神经网络为什么采用迭代方法求解训练集特征的均值和方差，而不是直接直接按列进行统计。BN通过迭代求解的主要原因有两点，其一是面对超大规模数据时候求解均值或方差可能无法直接进行求解，尤其是如果数据是分布式存储的话一次性求解均值和方差就会更加困难；其二，迭代计算是深度学习求解参数的基本过程，所有深度学习框架为了更好地满足迭代计算，都进行了许多精巧的设计，从而使得迭代计算更加高效和平稳，而在计算训练集整体均值时，如果借助神经网络向前传播的过程进行迭代记录，便能够在控制边际成本的情况下快速达成目标。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- momentum参数设置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;关于momentum的设置，一般来说，为了尽可能获取到更加准确的训练集整体统计量，当每一个小批数据数据量比较小时，我们应该将历史数据比重调高，也就是降低momentum取值，以减少局部规律对获取总体规律的影响，当然，此时我们也需要增加遍历数据的次数epochs；而反之则可以考虑增大momentum取值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- track_running_stats参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;对于BN来说，最后一个参数就是track_running_stats，默认取值为True。当此参数为True时，BN层会在每次迭代过程中会结合历史记录更新running_mean/var，当track_running_stats=False时，BN层将running_mean/var为None，并且在进行预测时会根据输入的小批数据进行均值和方差计算。在大多数情况下，并不推荐修改该参数的默认取值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### classwork 1\n",
    "\n",
    "* 设定一个4个特征5个样本的张量，并使用BN层进行归一化，并对比通过torch.mean的方式收到归一化的结果差异\n",
    "* 输出此bn层的参数和running_mean/running_var\n",
    "* 修改momentum参数，并观察结果\n",
    "* 通过循环运行50次归一，对比其running_mean/running_var与torch.mean的结果差异"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[ 0.,  1.,  2.,  3.],\n",
       "        [ 4.,  5.,  6.,  7.],\n",
       "        [ 8.,  9., 10., 11.],\n",
       "        [12., 13., 14., 15.],\n",
       "        [16., 17., 18., 19.]])"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f = torch.arange(20).reshape(5, 4).float()\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "bn4= nn.BatchNorm1d(4, momentum=0.3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.4142, -1.4142, -1.4142, -1.4142],\n",
       "        [-0.7071, -0.7071, -0.7071, -0.7071],\n",
       "        [ 0.0000,  0.0000,  0.0000,  0.0000],\n",
       "        [ 0.7071,  0.7071,  0.7071,  0.7071],\n",
       "        [ 1.4142,  1.4142,  1.4142,  1.4142]],\n",
       "       grad_fn=<NativeBatchNormBackward0>)"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn4(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([ 8.0000,  9.0000, 10.0000, 11.0000])\n"
     ]
    }
   ],
   "source": [
    "for i in range(50):\n",
    "    bn4(f)\n",
    "print(bn4.running_mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Parameter containing:\n",
       " tensor([1., 1., 1., 1.], requires_grad=True),\n",
       " Parameter containing:\n",
       " tensor([0., 0., 0., 0.], requires_grad=True)]"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(bn4.parameters())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([ 8.,  9., 10., 11.])"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f.mean(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([2.4000, 2.7000, 3.0000, 3.3000])"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn4.running_mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([12.7000, 12.7000, 12.7000, 12.7000])"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn4.running_var"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-1.4142, -1.4142, -1.4142, -1.4142],\n",
       "        [-0.7071, -0.7071, -0.7071, -0.7071],\n",
       "        [ 0.0000,  0.0000,  0.0000,  0.0000],\n",
       "        [ 0.7071,  0.7071,  0.7071,  0.7071],\n",
       "        [ 1.4142,  1.4142,  1.4142,  1.4142]])"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(f-f.mean(0))/f.std(0,unbiased=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BN层在神经网络中的应用\n",
    "\n",
    "###  model.train()与model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 伴随向前传播调整running_mean/var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;尽管BN找到了能够计算训练集整体均值和方差的方法，但截至目前，还是没有解决如何将在训练过程中记录的running_mean/var应用到测试集当中的方法。核心原因在于，BN层的running_mean/var是伴随向前传播同步调整的。也就是说，在track_running_stats开启时，只要进行一次向前传播，就会更新一次running_mean/var，这明显是不合适的（因为我们不能根据测试集数据修改参数）。但是我们也知道，只要要进行测试集的测试，就一定需要模型执行测试数据的向前传播以得出模型结果，此时应该怎么办呢？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- model.train()与model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;此时就要用到PyTorch中适用于nn模块中所有模型的一种方法——model.train()与model.eval()。其中，model.train()表示开启模型训练模式，在默认情况下，我们实例化的每一个模型都是出于训练模式的，而model.eval()则表示将模型转化为测试模式。我们可以简单查看该方法的相关说明。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "bn5 = nn.BatchNorm1d(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn5.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然，开始训练模式时参数取值为False则代表开启测试模式，即bn5.train(False)和bn5.eval()等效。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;那么，训练模式和测试模式的根本区别是什么？简单来说就是，在PyTorch中，其实有很多类似BN类一样，会在向前传播过程中直接自动修改模型参数，而当模型出于测试模式时，会避免这种情况发生，即模型出于测试模式时不会根据当前输入数据情况调整running_mean/var。并且，模型出于测试模式时，BN层会利用已经记录下的running_mean/var对数据进行归一化。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 在PyTorch中，训练模式和测试模式的区分，相当于是给模型提供了可设置的两种行为模式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 1., 2.],\n",
       "        [3., 4., 5.],\n",
       "        [6., 7., 8.]])"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn5 = nn.BatchNorm1d(3)\n",
    "bn5.train() # 进入训练模式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([0., 0., 0.])\n",
      "tensor([0.3000, 0.4000, 0.5000])\n"
     ]
    }
   ],
   "source": [
    "print(bn5.running_mean)\n",
    "bn5(f)\n",
    "print(bn5.running_mean)\n",
    "#bn5.running_var"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn5.eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn5.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([1.4057, 1.8742, 2.3428])\n",
      "tensor([1.4057, 1.8742, 2.3428])\n"
     ]
    }
   ],
   "source": [
    "print(bn5.running_mean)\n",
    "bn5(f)\n",
    "print(bn5.running_mean)               "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn5.training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另外我们需要知道，当模型（model）由多个模块（module）共同集成时，各模块的状态由模型整体状态决定。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 利用nn.BatchNorm构建带BN的神经网络"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;有了对nn.BatchNorm类基本了解之后，我们来尝试构建带有BN层的神经网络模型。根据上一节的介绍，我们知道BN本质上是一种自适应的数据分布调整算法，同时我们也知道，数据分布的调整并不影响，因此我们可以在任何需要的位置都进行BN归一化。另外，根据Glorot条件，我们知道在模型构建过程中需要力求各梯度计算的有效性和平稳性，因此我们可以考虑在每一个线性层前面或者后面进行数据归一化处理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;在实际模型构建过程中，在模型中添加BN层和添加线性层类似，我们只需要在自定义模型类的init方法中添加BN层，并在向前传播方法中确定BN层的调用方法即可。而在具体BN层位置选择方面，是放在线性层前面还是放在线性层后面，目前业内没有定论，既没有理论论证BN层的最佳位置，也没有严谨的实验证明在哪个位置放置BN层效果最好。因此我们大可根据自己的习惯，以及自身经验的判断选择BN层的位置。当然，为了实验方便，我们会同时创建可以前置或者后置BN层的模型。（这里我们将放在隐藏层前面（也就是线性层后面）称为BN层前置，放在隐藏层后面称为BN层后置。）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import DataLoader, TensorDataset\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 定义一个简单的带BN的神经网络模型\n",
    "class SimpleNetWithBN(nn.Module):\n",
    "    def __init__(self, input_dim=10, hidden_dim=20, output_dim=2):\n",
    "        super(SimpleNetWithBN, self).__init__()\n",
    "        self.fc1 = nn.Linear(input_dim, hidden_dim)\n",
    "        self.bn1 = nn.BatchNorm1d(hidden_dim)  # 添加BN层\n",
    "        self.fc2 = nn.Linear(hidden_dim, hidden_dim)\n",
    "        self.bn2 = nn.BatchNorm1d(hidden_dim)  # 添加BN层\n",
    "        self.fc3 = nn.Linear(hidden_dim, output_dim)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        x = torch.relu(self.bn1(self.fc1(x)))  # 先通过线性层再通过BN层\n",
    "        x = torch.relu(self.bn2(self.fc2(x)))  # 同样，先线性层再BN层\n",
    "        x = self.fc3(x)\n",
    "        x=torch.softmax(x,dim=1)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 生成模拟数据\n",
    "def generate_data(num_samples, input_dim, output_dim):\n",
    "    input_data = torch.randn(num_samples, input_dim)  # 生成输入数据\n",
    "    target_data = torch.randint(low=0, high=output_dim, size=(num_samples,))  # 生成目标标签\n",
    "    return input_data, target_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 参数设置\n",
    "num_samples = 100\n",
    "input_dim = 10\n",
    "hidden_dim = 20\n",
    "output_dim = 3\n",
    "\n",
    "# 生成模拟数据\n",
    "input_data, target_data = generate_data(num_samples, input_dim, output_dim)\n",
    "\n",
    "# 使用sklearn的train_test_split函数分割数据集\n",
    "train_inputs, test_inputs, train_targets, test_targets = train_test_split(\n",
    "    input_data.numpy(), target_data.numpy(), test_size=0.3, random_state=42\n",
    ")\n",
    "\n",
    "# 将numpy数组转换回PyTorch张量\n",
    "train_inputs = torch.from_numpy(train_inputs).float()\n",
    "test_inputs = torch.from_numpy(test_inputs).float()\n",
    "train_targets = torch.from_numpy(train_targets).long()\n",
    "test_targets = torch.from_numpy(test_targets).long()\n",
    "\n",
    "# 创建TensorDataset\n",
    "train_data = TensorDataset(train_inputs, train_targets)\n",
    "test_data = TensorDataset(test_inputs, test_targets)\n",
    "\n",
    "# 创建DataLoader\n",
    "train_loader = DataLoader(train_data, batch_size=10, shuffle=True)\n",
    "test_loader = DataLoader(test_data, batch_size=10, shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,\n",
       "        1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,\n",
       "        0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,\n",
       "        1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,\n",
       "        1, 0, 0, 1])"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch [1/10], Loss: 1.0679\n",
      "Epoch [2/10], Loss: 1.0736\n",
      "Epoch [3/10], Loss: 1.0769\n",
      "Epoch [4/10], Loss: 1.1231\n",
      "Epoch [5/10], Loss: 1.0656\n",
      "Epoch [6/10], Loss: 1.0681\n",
      "Epoch [7/10], Loss: 1.1423\n",
      "Epoch [8/10], Loss: 1.0976\n",
      "Epoch [9/10], Loss: 1.1334\n",
      "Epoch [10/10], Loss: 1.0872\n"
     ]
    }
   ],
   "source": [
    "\n",
    "\n",
    "# 实例化模型\n",
    "model = SimpleNetWithBN(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)\n",
    "\n",
    "# 定义损失函数和优化器\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer = optim.SGD(model.parameters(), lr=0.01)\n",
    "\n",
    "# 训练模型\n",
    "num_epochs = 10\n",
    "for epoch in range(num_epochs):\n",
    "    model.train()  # 设置模型为训练模式\n",
    "    for inputs, targets in train_loader:\n",
    "        # 前向传播\n",
    "        outputs = model(inputs)\n",
    "        \n",
    "        # 计算损失\n",
    "        loss = criterion(outputs, targets)\n",
    "        \n",
    "        # 反向传播\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        \n",
    "        # 更新权重\n",
    "        optimizer.step()\n",
    "    \n",
    "    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the model on the test: 23.33%\n"
     ]
    }
   ],
   "source": [
    "# 测试模型\n",
    "model.eval()  # 设置模型为评估模式\n",
    "with torch.no_grad():  # 不需要计算梯度\n",
    "    correct = 0\n",
    "    total = 0\n",
    "    for inputs, labels in test_loader:\n",
    "        outputs = model(inputs)\n",
    "        #print(outputs)\n",
    "        _, predicted = torch.max(outputs.data, 1)\n",
    "        #print(predicted)\n",
    "        total += labels.size(0)\n",
    "        correct += (predicted == labels).sum().item()\n",
    "    \n",
    "    print(f'Accuracy of the model on the test: {100 * correct / total:.2f}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "torch24",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
